Transforming Features





Kerry Back

Outliers, scaling, and polynomial features

  • For neural nets and other methods, it is important to have predictors that are
    • on the same scale
    • free of outliers
  • It is also useful to add squares and products of our predictors.
  • We will (i) take care of outliers and scaling, (ii) add squares and products, and (iii) define a machine learning model all within a pipeline.

Neural net example of scaling and outliers

For a neuron with

\[ y = \max(0, b + w_1x_1 + \cdots + w_n x_n)\]

  • to find the right \(w\)’s, it helps to have \(x\)’s of similar scales
  • multiplying an outlier by a weight \(w\) can produce an outlier \(y\)

Quantile transformer

There are many ways to take care of outliers and scaling, but we’ll just use one.

from sklearn.preprocessing import QuantileTransformer

transform = QuantileTransformer(
    output_distribution="normal"
)

Example: roeq in 2021-01

Distribution before (old) and after (new)

Pipelines

  • This will be our process:
    • Apply quantile transformer
    • Add squares and products
    • Apply quantile transformer again
  • We do this and define our ML model in a pipeline.
  • Then we fit the pipeline and predict with it.

from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline

poly = PolynomialFeatures(degree=2)
pipe = make_pipeline(
  transform, 
  poly,
  transform,
  model
)
pipe.fit(X, y)

Entire workflow: connect to database

from sqlalchemy import create_engine
import pymssql
import pandas as pd

server = "mssql-82792-0.cloudclusters.net:16272"
username = "user"
password = "" # paste password between quote marks
database = "ghz"

string = "mssql+pymssql://" + username + ":" + password + "@" + server + "/" + database

conn = create_engine(string).connect()

Download data

data = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date='2021-01'
    """, 
    conn
)
data = data.dropna()
data['rnk'] = data.ret.rank(pct=True)

Import from scikit-learn

from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.neural_network import MLPRegressor
from sklearn.pipeline import make_pipeline

Define pipeline

transform = QuantileTransformer(
    output_distribution="normal"
)
poly = PolynomialFeatures(degree=2)
model = MLPRegressor(
  hidden_layer_sizes=(4, 2),
  random_state=0
)
pipe = make_pipeline(
  transform, 
  poly,
  transform,
  model
)

Fit and save the pipeline

X = data[["roeq", "mom12m"]]
y = data["rnk"]

pipe.fit(X, y)

from joblib import dump, load
dump(pipe, "net2.joblib")


Later:

model = load("net2.joblib")

Comments

  • Same workflow for random forest. Just import RandomForestRegressor and use it in model = .
  • We’re going to change the last block in the next section. Put the pipeline through GridSearchCV and fit it.